Mechanics of Bivariate Regression
Why least squares?
Application to causality
No causality
No parameters
Regression/Least Squares as Algorithm
Association between two variables
The Mean
Regression/Least Squares extends the mean
How are these variables associated?
How are these variables associated?
Covariance
\[Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\]
\[Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\]
Variance
Variance is also the covariance of a variable with itself:
\[Var(X) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})^2\]
\[Var(X) = \overline{x^2} - \bar{x}^2\]
Covariance of tree width and tree height:
x1 = trees$Girth
y1 = trees$Height
mean(x1*y1) - (mean(x1)*mean(y1))## [1] 10.04839
Covariance of tree width and timber volume:
x2 = trees$Girth
y2 = trees$Volume
mean(x2*y2) - (mean(x2)*mean(y2))## [1] 48.27882
Why is the \(Cov(Girth,Volume)\) larger?
Why is the \(Cov(Girth,Volume)\) larger?
Scale of covariance reflects scale of the variables.
Can’t directly compare the two covariances
Covariance
\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
Pearson Correlation
\(r(X,Y) = \frac{Cov(X,Y)}{SD(X)SD(Y)}\)
Dividing by product of standard deviations scales the covariance
\(|Cov(X,Y)| <= \sqrt{Var(X)*Var(Y))}\)
Correlation coefficient must be between \(-1, 1\)
At \(-1\) or \(1\), all points are on a straight line
Negative value implies increase in \(X\) associated with decrease \(Y\).
If correlation is \(0\), the covariance must be?
If \(Var(X)=0\), then \(Cor(X,Y) = ?\)
Correlation of \((x,y)\) is same as correlation of \((y,x)\)
Values closer to -1 or 1 imply “stronger” association
Correlation is not causation
In groups of 2-3:
Without using cor() or cov() or var() functions in R
Why are we always squaring differences?
Variance
\(\frac{1}{n}\sum\limits_{i = 1}^{n} (x_i - \bar{x})^2\)
Covariance
\(\frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)
Mean Squared Error
\(\frac{1}{n}\sum\limits_{i=1}^n(\hat{y_i} - y_i)^2\)
What is the distance between two points?
In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)
\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)
What is the distance between two points?
What is the distance between two points?
\(p = (3,0); q = (0,4)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\] \[d(p,q) = \sqrt{(3 - 0)^2 + (0 - 4)^2}\] \[d(p,q) = \sqrt{3^2 + (-4)^2} = \ ?\]
What is the distance between two points?
What is the distance between two points?
In \(2\) dimensional space: \((p_1,p_2)\), \((q_1,q_2)\)
\[d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}\]
In \(k\) dimensional space: \((p_1,p_2, \ldots, p_k)\), \((q_1,q_2, \ldots ,q_k)\)
\(d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_k - q_k)^2}\)
The mean minimizes the variance.
The mean minimizes the variance.
If we observe values of \(Y\), \(y_i\); we choose a single number, \(\hat{y}\), to be our estimate for each value of \(y\):
the mean is the estimate that minimizes the distance between \(\hat{y}\) and each of the \(y_i\)s.
Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.
\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\]
a vector is a one-dimensional array of numbers of length \(n\).
can be portrayed graphically as an arrow from the origin to a point in \(n\) dimensional space
can be multiplied by a number to extend/shorten their length
We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values in our vector \(y\).
This is equivalent to doing this:
\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix} \approx \begin{pmatrix}\hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix}1 \\ 1 \end{pmatrix}\]
Multiplying a (1,1) vector by a constant.
Choose \(\hat{y}\) on the blue line at point that minimizes the distance to \(y\).
\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\)
can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):
\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)
and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations:
\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)
This means our goal is to minimize \(\mathbf{e}\).
How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:
\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]
When is the length of \(\mathbf{e}\) minimized?
Value of sample of size \(n\) are represented by a vector in \(n\) dimensional space
We choose a value \(\hat{y}\) in one dimensional sub-space (typically on a line through: \(\begin{pmatrix} 1_1 \ 1_2 \ldots 1_n \end{pmatrix}\))
Such that \(\hat{y}\) is smallest distance to \(y\).
The mean is useful…
… but often we want to know if the mean of something \(Y\) is different across different values of something else \(X\).
To put it another way: the mean of \(Y\) is the \(E(Y)\) (if we are talking about random variables). Sometimes we want to know \(E(Y | X)\)
We are interested in finding some conditional expectation function (Angrist and Pischke)
expectation: because it is about the mean - \(E(Y)\)
conditional: because it is conditional on values of \(X\) … \(E[Y |X]\)
function: because \(E(Y) = f(X)\), there is some relationship we can look at between values of \(X\) and \(E(Y)\).
\[E[Y | X = x]\]
There are many ways to get the conditional expectation function
or, by convention:
\[E(Y) = a + b\cdot X\]
The red line above is the regression line or the fit using least squares.
It closely approximates the conditional mean of son’s height (\(Y\)) across values of father’s height (\(X\)).
How do we obtain this line mathematically?
We can do it the same way we obtained the mean!
We are going to choose an intercept \(a\) and slope \(b\) such that:
\(\hat{y}_i = a + b \cdot x_i\)
and that minimizes the distance between the fitted (\(\hat{y}_i\)) and true (\(y_i\)) values:
\(\sqrt{\sum\limits_i^n (y_i - \hat{y_i})^2}\)
Another way of thinking of this is in terms of residuals, or the difference between true and fitted values using the equation of the line.
\(e_i = y_i - \hat{y_i}\)
Minimizing the distance also means minimizing the sum of squared residuals \(e_i\)
We need to solve this equation:
\[\min_{a,b} \sum\limits_i^n (y_i - a - b x_i)^2\] Choose \(a\) and \(b\) to minimize this value, given \(x_i\) and \(y_i\)
We can do this with calculus: solve for when first derivative is \(0\)
First, we take derivative with respect to \(a\): yields:
\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) \right] = 0\)
\(\sum\limits_i^n y_i - \sum\limits_i^n a - \sum\limits_i^n b x_i = 0\)
\(-\sum\limits_i^n a = -\sum\limits_i^n y_i + \sum\limits_i^n b x_i\)
\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)
\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)
Dividing both sides by \(n\), we get:
\(a = \bar{y} - b\bar{x}\)
Where \(\bar{y}\) is mean of \(y\) and \(\bar{x}\) is mean of \(x\).
Implication: regression line goes through the point of averages \(\bar{y} = a + b \bar{x}\)
Next, we take derivative with respect to \(b\):
\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) x_i\right] = 0\)
\(\sum\limits_i^n (y_i - (\bar{y} - b\bar{x}) - b x_i) x_i = 0\)
\(\sum\limits_i^n (y_i - (\bar{y} - b\bar{x}) - b x_i) x_i = 0\)
\(\sum\limits_i^n y_ix_i - \bar{y}x_i + b\bar{x}x_i - b x_ix_i= 0\)
\(\sum\limits_i^n (y_i - \bar{y})x_i = b\sum\limits_i^n (x_i - \bar{x})x_i\)
Dividing both sides by \(n\) gives us:
\(\frac{1}{n}\sum\limits_i^n y_ix_i - \bar{y}x_i = b\frac{1}{n}\sum\limits_i^n x_i^2 - \bar{x}x_i\)
\(\overline{yx} - \bar{y}\bar{x} = b \overline{xx} - \bar{x}\bar{x}\)
\(Cov(y,x) = b \cdot Var(x)\)
\(\frac{Cov(y,x)}{Var(x)} = b\)
\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]
\[a = \overline{y} - \overline{x}\cdot b\]
Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.
There are other ways to derive least squares.
The math of regression ensures that:
\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares. We will see why this is next week. But we can also prove it in class.
\(3\). \(Var(X) > 0\) in order to compute \(a\) and \(b\). Why is this?
Using the trees data in R:
Take \(Y = Volume\) and \(X = Girth\)
These facts are mathematical truths about least squares. Unrelated to assumptios needed for statistical/causal inference
Can fit least squares to any scatterplot (regardless of how sensical it is), if \(x\) has positive variance.
Least squares line minimizes the sum of squared residuals (minimizes the distance between predicted values and actual values of \(y\)).
Least Squares line always goes through point of averages; can be computed exactly from “graph of averages”
Residuals \(e\) are always uncorrelated with \(x\) if there is an intercept because they are orthogonal \(x\) with mean of \(0\).